Context:
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
Attribute Information:
● All the features are geometric features extracted from the silhouette.
● All are numeric in nature.
Objective:
Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from scipy.stats import iqr
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score
from scipy.stats import zscore
from sklearn.decomposition import PCA
#loading data
df=pd.read_csv('vehicle.csv')
df.head()
shape=df.shape #Provides the Shape in (Rows, Columns) in the Data Frame df
print('shape of the data frame is =',shape)
#Column names
df.columns
Attribute Information
COMPACTNESS (average perim)**2/area
CIRCULARITY (average radius)**2/area
DISTANCE CIRCULARITY area/(av.distance from border)**2
RADIUS RATIO (max.rad-min.rad)/av.radius
PR.AXIS ASPECT RATIO (minor axis)/(major axis)
MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)
SCATTER RATIO (inertia about minor axis)/(inertia about major axis)
ELONGATEDNESS area/(shrink width)**2
PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)
MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)
SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS
SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS
SCALED RADIUS OF GYRATION (mavar+mivar)/area
SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS
SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS
KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 MINOR AXIS
KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 MAJOR AXIS
HOLLOWS RATIO (area of hollows)/(area of bounding polygon)
Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and
area of hollows= area of bounding poly-area of object
The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.
#dataframe information
df.info()
df['class'].value_counts()
#Converting the Data type for the Categorical attributes from object to Category data type
df = df.astype({"class":'category'})
i = 0
#Length of the coulmns of the data fram
n=len(df.columns)
#List of all the attribues in the data frame
List=list(df.columns.values)
print('Data type of each attribute of Data frame:\n')
while i < n:
New_List=List[i]
Data_type=df[New_List].dtype
print('Data Type of',New_List,'attribute is:',Data_type)
i=i+1
le = LabelEncoder()
df['class'] = le.fit_transform(df['class'])
df['class']
df['class'].value_counts()
df.info()
print('Checking the presence of missing values in the Data frame:\n')
null_value_count = df.isnull().sum()
i = 0
#Length of the coulmns of the data fram
n=len(df.columns)
#List of all the attribues in the data frame
List=list(df.columns.values)
while i < n:
New_List=List[i]
print('There are',null_value_count[i],'null values in',New_List,'attribute in the dataframe')
i=i+1
There are few missing values in most of the attributes, we assume that these values are missed by random reason and replace the missing values with Median
#Stats of the dataframe
df.describe().T
df=df.replace('', np.nan)
newdf=df.copy()
X = df.iloc[:,0:19]
imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)
transformed_values = imputer.fit_transform(X)
column = X.columns
print(column)
newdf = pd.DataFrame(transformed_values, columns = column )
newdf.describe().T
We are adding a new dataframe newdf to store the value of df post missing values treatment
print('Checking the presence of missing values in the New Data frame:\n')
null_value_count = newdf.isnull().sum()
i = 0
#Length of the coulmns of the data fram
n=len(newdf.columns)
#List of all the attribues in the data frame
List=list(newdf.columns.values)
while i < n:
New_List=List[i]
print('There are',null_value_count[i],'null values in',New_List,'attribute in the dataframe')
i=i+1
print('Measure of skewness of Quantitative Data in the New Dataframe newdf')
i = 0
List=list(newdf.columns.values)
n=len(List)
while i < n:
New_List=List[i]
skew=newdf[New_List].skew(axis = 0, skipna = True)
if (skew==0):
conclusion='Data is normally distributed or Symmetric'
elif(skew<0):
conclusion='Data is Left-Skewed'
else:
conclusion='Data is Right-Skewed'
print('Skewness of',New_List,'is: %.3f'%skew,'and',conclusion)
i=i+1
print('Checking the presence of outliers of Quantitative Data in the New Dataframe newdf')
i = 0
total_outliers=0
List=list(newdf.columns.values)
n=len(List)
while i < n:
New_List=List[i]
minimum,q1,q3,maximum= np.percentile(newdf[New_List],[0,25,75,100])
iqr=q3-q1
lower_value=q1-(1.5 * iqr)
upper_value=q3+(1.5 * iqr)
if ((minimum<lower_value) or (maximum>upper_value)):
outliers = [x for x in df[New_List] if x < lower_value or x > upper_value]
print('Identified outliers for',New_List,'out of', len(newdf[New_List]),'records: %d' % len(outliers))
total_outliers=total_outliers+len(outliers)
else:
print('There is no outlier for the attribute',New_List)
i=i+1
print('Total number of outliers are:',total_outliers)
The below columns have outliers.
radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, scaled_radius_of_gyration.1 skewness_about skewness_about.1
#Checking for the outliers using boxplot
i = 0
List=list(newdf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n:
New_List=List[i]
plt.subplot(9,2,i+1)
sns.boxplot(newdf[New_List])
i=i+1
plt.show()
Outlier Treatment using a new Data frame cleandf
Q1 = newdf.quantile(0.25)
Q3 = newdf.quantile(0.75)
IQR = Q3 - Q1
print('upper_value:\n',Q3+1.5*IQR)
print('\nLower value:\n',Q1-1.5*IQR)
cleandf = newdf[~((newdf < (Q1 - 1.5 * IQR)) |(newdf > (Q3 + 1.5 * IQR))).any(axis=1)]
cleandf
print('Checking the presence of outliers of Quantitative Data in the clean Dataframe post outlier treatment')
i = 0
total_outliers=0
List=list(cleandf.columns.values)
n=len(List)
while i < n:
New_List=List[i]
minimum,q1,q3,maximum= np.percentile(cleandf[New_List],[0,25,75,100])
iqr=q3-q1
lower_value=q1-(1.5 * iqr)
upper_value=q3+(1.5 * iqr)
if ((minimum<lower_value) or (maximum>upper_value)):
outliers = [x for x in cleandf[New_List] if x < lower_value or x > upper_value]
print('Identified outliers for',New_List,'out of', len(cleandf[New_List]),'records: %d' % len(outliers))
total_outliers=total_outliers+len(outliers)
else:
print('There is no outlier for the attribute',New_List)
i=i+1
print('Total number of outliers are:',total_outliers)
Most of the Outlier are removed, and the one outlier which is availabe in the scaled_variance.1 is from the previous data, which was not an outlier earlier and can be ignored
#Checking for the outliers in cleandf using boxplot
i = 0
List=list(cleandf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n:
New_List=List[i]
plt.subplot(9,2,i+1)
sns.boxplot(cleandf[New_List])
i=i+1
plt.show()
We can proceed with either the newdf with Outlier or the cleandf in which we have treated the outliers. The outliers are very few and they are not so unrealistic. Hence, we need not remove them since the prediction model should represent the real world. This improves the generalizability of the model and makes it robust for real world situations. The outliers, therefore, are not removed and we will proceed with newdf.
# Histogram Plot of Quantitative Data
i = 0
List=list(newdf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n:
New_List=List[i]
plt.subplot(9,2,i+1)
plt.hist(newdf[New_List],edgecolor = 'black')
plt.xlabel(New_List)
i=i+1
plt.show()
Quick Observation :
Most of the data attributes seems to be normally distributed
Few Many of the attributes are Right Skewed as mentioned earlier in skewness check.
# Strip Plot of Quantitative Data
i = 0
List=list(newdf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n:
New_List=List[i]
plt.subplot(9, 2, i+1)
sns.stripplot(df['class'],df[New_List])
plt.xlabel(New_List)
i=i+1
plt.show()
#We will use Pearson Correlation Coefficient to see what all attributes are linearly related
newdf.corr()
plt.figure(figsize=(15,15))
sns.heatmap(newdf.corr(),annot=True,square=True,fmt='.2f')
plt.show()
print('Insights From Correlation Heatmap:\n');
print('Attributes with high correlation of greater than 0.9 or less than 0.9:\n');
a=newdf.corr()
i = 0
j = 0
c = 0
n=len(a.columns)
col=list(a.columns.values)
ind=list(a.index.values)
while i < n:
sInd=ind[i]
while j < n:
sCol=col[j]
value=a.loc[sInd,sCol]
if(((value>0.9) or (value<-0.9))&(sInd!=sCol)):
print('Correlation between',sInd,'&',sCol,'is',a.loc[sInd,sCol])
j=j+1
c=c+1
j = c
i=i+1
print('Insights From Correlation Heatmap:\n');
print('Attributes with low correlation of -0.3 TO 0.3:\n');
a=newdf.corr()
i = 0
c = 0
j = 0
n=len(a.columns)
col=list(a.columns.values)
ind=list(a.index.values)
while i < n:
sInd=ind[i]
while j < n:
sCol=col[j]
value=a.loc[sInd,sCol]
if((value<.3)&(sInd!=sCol)&(value>-0.3)):
print('Correlation between',sInd,'&',sCol,'is',a.loc[sInd,sCol])
j = j+1
c = c+1
j = c
i = i+1
If two features is highly correlated then there is no point using both features.in that case, we can drop one feature. SNS heatmap gives us the correlation matrix where we can see which features are highly correlated.
From above correlation matrix we can see that there are many features which are highly correlated. if we carefully analyse, we will find that many features are there which having more than 0.9 correlation. so we can decide to get rid of those columns whose correlation is +-0.9 or above.
sns.pairplot(newdf,hue='class' ,diag_kind="kde")
plt.show()
Observations:
We can see the similar observation from the Correlation heat map, that there are few Attributes which are Highly Positively/Negatively correlated
As per the above correlation values, from the attribute which has high correlation we can drop any one of the atribute, and as per analysis we can DROP the below mentioned attributes:
max.length_rectangularity
scaled_radius_of_gyration
distance_circularity
elongatedness
pr.axis_rectangularity
scaled_variance
scaled_variance.1
First let us train the model with the RAW Data and then use the PCA to decide on the dimentionality reduction
#Split our data into train and test data set
seed=12
x=newdf.drop(['class'],axis=1)
y=newdf['class']
x_train_df,x_test_df,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=seed)
x_train_df
y_train
#Checking the split of the data
print("{0:0.2f}% data is in training set".format((len(x_train_df)/len(newdf)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test_df)/len(newdf)) * 100))
print("Original class bus Values : {0} ({1:0.2f}%)".format(len(newdf.loc[newdf['class'] == 0]), (len(newdf.loc[newdf['class'] == 0])/len(newdf)) * 100))
print("Original class car Values : {0} ({1:0.2f}%)".format(len(newdf.loc[newdf['class'] == 1]), (len(newdf.loc[newdf['class'] == 1])/len(newdf)) * 100))
print("Original class van Values : {0} ({1:0.2f}%)".format(len(newdf.loc[newdf['class'] == 2]), (len(newdf.loc[newdf['class'] == 2])/len(newdf)) * 100))
print("")
print("Training class bus Values : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train == 0])/len(y_train)) * 100))
print("Training class car Values : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train == 1])/len(y_train)) * 100))
print("Training class van Values : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 2]), (len(y_train[y_train == 2])/len(y_train)) * 100))
print("")
print("Test class bus Values : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test == 0])/len(y_test)) * 100))
print("Test class car Values : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test == 1])/len(y_test)) * 100))
print("Test class van Values : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 2]), (len(y_test[y_test == 2])/len(y_test)) * 100))
print("")
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(x_train_df)
x_train = pd.DataFrame(scaled_df)
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(x_test_df)
x_test = pd.DataFrame(scaled_df)
4a. Linear Support Vector Machine
#Linear Support vector Machine
lsvm = SVC(kernel='linear',random_state=seed)
lsvm.fit(x_train, y_train)
print('Train Data Score :',np.round(lsvm.score(x_train, y_train),4))
print('Test Data Score :',np.round(lsvm.score(x_test, y_test),4))
#Predict for train set
pred_train = lsvm.predict(x_train)
#Confusion Matrix
lsvm_cm_train = pd.DataFrame(confusion_matrix(y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
lsvm_cm_train.index.name = "Predicted"
lsvm_cm_train.columns.name = "True"
lsvm_cm_train
#Predict for test set
pred_test = lsvm.predict(x_test)
#Confusion Matrix
lsvm_cm_test = pd.DataFrame(confusion_matrix(y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
lsvm_cm_test.index.name = "Predicted"
lsvm_cm_test.columns.name = "True"
lsvm_cm_test
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Linear Support vector machine for test data: \n")
ax=sns.heatmap(lsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
#summarize the fit of the model
lsvm_accuracy = np.round( metrics.accuracy_score( y_test, pred_test ), 4 )
#lsvm_precision = np.round( metrics.precision_score( y_test, pred_test ,average=None), 4 )
#lsvm_recall = np.round( metrics.recall_score( y_test, pred_test,average=None ), 4 )
#lsvm_f1score = np.round( metrics.f1_score( y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', lsvm_accuracy)
print('\n')
print('Metrics Classification Report for linear Support vector machine regression\n',metrics.classification_report(y_test, pred_test))
4b. Poly Support Vector Machine
psvm = SVC(kernel='poly',random_state=seed,gamma='scale')
psvm.fit(x_train, y_train)
#Poly Support Vector machine
print('Train Data Score :',np.round(psvm.score(x_train, y_train),4))
print('Test Data Score :',np.round(psvm.score(x_test, y_test),4))
#Predict for train set
pred_train = psvm.predict(x_train)
#Confusion Matrix
psvm_cm_train = pd.DataFrame(confusion_matrix(y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
psvm_cm_train.index.name = "Predicted"
psvm_cm_train.columns.name = "True"
psvm_cm_train
#Predict for test set
pred_test = psvm.predict(x_test)
#Confusion Matrix
psvm_cm_test = pd.DataFrame(confusion_matrix(y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
psvm_cm_test.index.name = "Predicted"
psvm_cm_test.columns.name = "True"
psvm_cm_test
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for poly Support vector machine for test data: \n")
ax=sns.heatmap(psvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
#summarize the fit of the model
psvm_accuracy = np.round( metrics.accuracy_score( y_test, pred_test ), 4 )
#psvm_precision = np.round( metrics.precision_score( y_test, pred_test ,average=None), 4 )
#psvm_recall = np.round( metrics.recall_score( y_test, pred_test,average=None ), 4 )
#psvm_f1score = np.round( metrics.f1_score( y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', psvm_accuracy)
print('\n')
print('Metrics Classification Report for poly Support vector machine regression\n',metrics.classification_report(y_test, pred_test))
4c. Radial basis function Support vector machine
rsvm = SVC(kernel='rbf',random_state=seed,gamma='scale')
rsvm.fit(x_train, y_train)
#rbf Support Vector machine
print('Train Data Score :',np.round(rsvm.score(x_train, y_train),4))
print('Test Data Score :',np.round(rsvm.score(x_test, y_test),4))
#Predict for train set
pred_train = rsvm.predict(x_train)
#Confusion Matrix
rsvm_cm_train = pd.DataFrame(confusion_matrix(y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
rsvm_cm_train.index.name = "Predicted"
rsvm_cm_train.columns.name = "True"
rsvm_cm_train
#Predict for test set
pred_test = rsvm.predict(x_test)
#Confusion Matrix
rsvm_cm_test = pd.DataFrame(confusion_matrix(y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
rsvm_cm_test.index.name = "Predicted"
rsvm_cm_test.columns.name = "True"
rsvm_cm_test
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Radial basis function Support vector machine for test data: \n")
ax=sns.heatmap(rsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
#summarize the fit of the model
rsvm_accuracy = np.round( metrics.accuracy_score( y_test, pred_test ), 4 )
#rsvm_precision = np.round( metrics.precision_score( y_test, pred_test ,average=None), 4 )
#rsvm_recall = np.round( metrics.recall_score( y_test, pred_test,average=None ), 4 )
#rsvm_f1score = np.round( metrics.f1_score( y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', rsvm_accuracy)
print('\n')
print('Metrics Classification Report for Radial basis function Support vector machine regression\n',metrics.classification_report(y_test, pred_test))
4d. Comparing SVM Accuracy scores for different Kernel
SVMresult_Before_PCA = pd.DataFrame({'Model' : ['SVM Linear', 'SVM Polynomial', 'SVM RBF'],
'Model Accuracy Before PCA' : [lsvm_accuracy, psvm_accuracy, rsvm_accuracy],
})
SVMresult_Before_PCA
Insights:
From the above results, Radial basis function SVM Model trained using the RAW Data gives the higher Accuracy when compared to the Linear and the Polynomial SVM Model
#K fold Cross validation using the K- Fold value as 10, in the Linear SVM Model
kf=KFold(n_splits= 10, random_state = seed)
lsvm_results = cross_val_score(estimator = lsvm, X = x_train, y = y_train, cv = kf)
lsvm_kf_accuracy=lsvm_results.mean()
print(lsvm_kf_accuracy)
#K fold Cross validation in the Polynomial SVM Model
psvm_results = cross_val_score(estimator = psvm, X = x_train, y = y_train, cv = kf)
psvm_kf_accuracy=psvm_results.mean()
print(psvm_kf_accuracy)
#K fold Cross validation in the RBF SVM Model
rsvm_results = cross_val_score(estimator = rsvm, X = x_train, y = y_train, cv = kf)
rsvm_kf_accuracy=rsvm_results.mean()
print(rsvm_kf_accuracy)
Cross_Validation_Score_Before_PCA = pd.DataFrame({'Model' : ['SVM Linear KF', 'SVM Polynomial KF', 'SVM RBF KF'],
' Cross validation Score Before PCA' : [lsvm_kf_accuracy, psvm_kf_accuracy, rsvm_kf_accuracy],
})
Cross_Validation_Score_Before_PCA
Insights:
From the above results, Radial basis function SVM Model trained using the RAW Data gives the highest Average Accuracy in the K-Fold Validation when compared to the Linear and the Polynomial SVM Model
We will perform PCA in the following steps:
Split our data into train and test data set
normalize the training set using standard scalar
Calculate the covariance matrix.
Calculate the eigenvectors and their eigenvalues.
Sort the eigenvectors according to their eigenvalues in descending order.
Choose the first K eigenvectors (where k is the dimension we'd like to end up with).
Build new dataset with reduced dimensionality.
#Split our data into train and test data set
x_PCA=newdf.drop(['class'],axis=1)
y_PCA=newdf['class']
x_train_df_PCA,x_test_df_PCA,y_train_PCA,y_test_PCA=train_test_split(x_PCA,y_PCA,test_size=0.3,random_state=seed)
scaler = preprocessing.StandardScaler()
scaled_df_PCA = scaler.fit_transform(x_train_df_PCA)
x_train_PCA = pd.DataFrame(scaled_df_PCA)
scaler = preprocessing.StandardScaler()
scaled_df_PCA = scaler.fit_transform(x_test_df_PCA)
x_test_PCA = pd.DataFrame(scaled_df_PCA)
shape=x_train_PCA.shape #Provides the Shape in (Rows, Columns) in the Data Frame df
print('shape of the data frame is =',shape)
#Checking the split of the data
print("{0:0.2f}% data is in training set".format((len(x_train_PCA)/len(newdf)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test_PCA)/len(newdf)) * 100))
#Calculation of CovMatrix
covMatrix = np.cov(x_train_PCA,rowvar=False)
print(covMatrix)
pca = PCA(n_components=18,random_state=seed)
pca.fit(x_train_PCA)
#The eigen Values
print(pca.explained_variance_)
#The eigen Vectors
print(pca.components_)
#percentage of variation explained by each eigen Vector
print(pca.explained_variance_ratio_)
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum Sum of variation explained')
plt.xlabel('eigen Value')
plt.show()
i=1;
while i<19:
pca = PCA(n_components=i,random_state=seed)
pca.fit(x_train_PCA)
print('Principal component of',i,'Captures around ',np.round(pca.explained_variance_ratio_.sum()*100,2),'Percent of Variance in the data')
i=i+1
Insights:
For Principal component of 7 the model Captures around more than 95% Percent of Variance in the data, so we decide to use the Number of component as 7 and create a new dataframe X_Train_pca7
#n_Components=7 capture about 95% of the variance in the data
pca7 = PCA(n_components=7)
pca7.fit(x_train_PCA)
x_train_pca7 = pca7.transform(x_train_PCA)
pd.DataFrame(x_train_pca7)
#n_Components=7 capture about 95% of the variance in the data, Preparing the PCA Test data set for n=7
pca7 = PCA(n_components=7)
pca7.fit(x_test_PCA)
x_test_pca7 = pca7.transform(x_test_PCA)
pd.DataFrame(x_test_pca7)
7a. Train a Support vector machine using the train set and get the accuracy on the test set on the Principal Component
PCA_x_train=x_train_pca7
PCA_y_train=y_train_PCA
PCA_x_test=x_test_pca7
PCA_y_test=y_test_PCA
#Linear Support vector Machine for Principal component
PCA_lsvm = SVC(kernel='linear',random_state=seed)
PCA_lsvm.fit(PCA_x_train,PCA_y_train)
print('Train Data Score :',np.round(PCA_lsvm.score(PCA_x_train, PCA_y_train),4))
print('Test Data Score :',np.round(PCA_lsvm.score(PCA_x_test, PCA_y_test),4))
#Predict for PCA train set
pred_train = PCA_lsvm.predict(PCA_x_train)
#Confusion Matrix
PCA_lsvm_cm_train = pd.DataFrame(confusion_matrix(PCA_y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_lsvm_cm_train.index.name = "Predicted"
PCA_lsvm_cm_train.columns.name = "True"
PCA_lsvm_cm_train
#Predict for PCA test set
pred_test = PCA_lsvm.predict(PCA_x_test)
#Confusion Matrix
PCA_lsvm_cm_test = pd.DataFrame(confusion_matrix(PCA_y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_lsvm_cm_test.index.name = "Predicted"
PCA_lsvm_cm_test.columns.name = "True"
PCA_lsvm_cm_test
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Linear Support vector machine for test data: \n")
ax=sns.heatmap(PCA_lsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
#summarize the fit of the model
PCA_lsvm_accuracy = np.round( metrics.accuracy_score( PCA_y_test, pred_test ), 4 )
#PCA_lsvm_precision = np.round( metrics.precision_score( PCA_y_test, pred_test ,average=None), 4 )
#PCA_lsvm_recall = np.round( metrics.recall_score( PCA_y_test, pred_test,average=None ), 4 )
#PCA_lsvm_f1score = np.round( metrics.f1_score( PCA_y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', PCA_lsvm_accuracy)
print('\n')
print('Metrics Classification Report for Linear Support vector machine regression\n',metrics.classification_report(PCA_y_test, pred_test))
#Poly Support vector Machine for Principal component
PCA_psvm = SVC(kernel='poly',gamma='auto',random_state=seed)
PCA_psvm.fit(PCA_x_train,PCA_y_train)
print('Train Data Score :',np.round(PCA_psvm.score(PCA_x_train, PCA_y_train),4))
print('Test Data Score :',np.round(PCA_psvm.score(PCA_x_test, PCA_y_test),4))
#Predict for PCA train set
pred_train = PCA_psvm.predict(PCA_x_train)
#Confusion Matrix
PCA_psvm_cm_train = pd.DataFrame(confusion_matrix(PCA_y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_psvm_cm_train.index.name = "Predicted"
PCA_psvm_cm_train.columns.name = "True"
PCA_psvm_cm_train
#Predict for PCA test set
pred_test = PCA_psvm.predict(PCA_x_test)
#Confusion Matrix
PCA_psvm_cm_test = pd.DataFrame(confusion_matrix(PCA_y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_psvm_cm_test.index.name = "Predicted"
PCA_psvm_cm_test.columns.name = "True"
PCA_psvm_cm_test
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Poly Support vector machine for test data: \n")
ax=sns.heatmap(PCA_psvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
#summarize the fit of the model
PCA_psvm_accuracy = np.round( metrics.accuracy_score( PCA_y_test, pred_test ), 4 )
#PCA_psvm_precision = np.round( metrics.precision_score( PCA_y_test, pred_test ,average=None), 4 )
#PCA_psvm_recall = np.round( metrics.recall_score( PCA_y_test, pred_test,average=None ), 4 )
#PCA_psvm_f1score = np.round( metrics.f1_score( PCA_y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', PCA_psvm_accuracy)
print('\n')
print('Metrics Classification Report for Poly Support vector machine regression\n',metrics.classification_report(PCA_y_test, pred_test))
#RBF Support vector Machine for Principal component
PCA_rsvm = SVC(kernel='rbf',random_state=seed,gamma='auto')
PCA_rsvm.fit(PCA_x_train,PCA_y_train)
print('Train Data Score :',np.round(PCA_rsvm.score(PCA_x_train, PCA_y_train),4))
print('Test Data Score :',np.round(PCA_rsvm.score(PCA_x_test, PCA_y_test),4))
#Predict for PCA train set
pred_train = PCA_rsvm.predict(PCA_x_train)
#Confusion Matrix
PCA_rsvm_cm_train = pd.DataFrame(confusion_matrix(PCA_y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_rsvm_cm_train.index.name = "Predicted"
PCA_rsvm_cm_train.columns.name = "True"
PCA_rsvm_cm_train
#Predict for PCA test set
pred_test = PCA_rsvm.predict(PCA_x_test)
#Confusion Matrix
PCA_rsvm_cm_test = pd.DataFrame(confusion_matrix(PCA_y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_rsvm_cm_test.index.name = "Predicted"
PCA_rsvm_cm_test.columns.name = "True"
PCA_rsvm_cm_test
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for RBF Support vector machine for test data: \n")
ax=sns.heatmap(PCA_rsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
#summarize the fit of the model
PCA_rsvm_accuracy = np.round( metrics.accuracy_score( PCA_y_test, pred_test ), 4 )
#PCA_rsvm_precision = np.round( metrics.precision_score( PCA_y_test, pred_test ,average=None), 4 )
#PCA_rsvm_recall = np.round( metrics.recall_score( PCA_y_test, pred_test,average=None ), 4 )
#PCA_rsvm_f1score = np.round( metrics.f1_score( PCA_y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', PCA_rsvm_accuracy)
print('\n')
print('Metrics Classification Report for RBF Support vector machine regression\n',metrics.classification_report(PCA_y_test, pred_test))
SVMresult_After_PCA = pd.DataFrame({'Model' : ['SVM Linear', 'SVM Polynomial', 'SVM RBF'],
'Model Accuracy After PCA' : [PCA_lsvm_accuracy, PCA_psvm_accuracy, PCA_rsvm_accuracy],
})
SVMresult_After_PCA
Insights:
From the above results, Radial basis function SVM Model trained using the Xpca df gives the higher Accuracy when compared to the Linear and the Polynomial SVM Model
7b. Perform K-fold cross validation and get the cross validation score of the model for Principal Component
#K fold Cross validation
PCA_kf=KFold(n_splits= 10, random_state = seed)
PCA_lsvm_results = cross_val_score(estimator = PCA_lsvm, X = PCA_x_train, y = PCA_y_train, cv = PCA_kf)
PCA_lsvm_PCA_kf_accuracy=PCA_lsvm_results.mean()
print(PCA_lsvm_PCA_kf_accuracy)
#K fold Cross validation
PCA_psvm_results = cross_val_score(estimator = PCA_psvm, X = PCA_x_train, y = PCA_y_train, cv = PCA_kf)
PCA_psvm_PCA_kf_accuracy=PCA_psvm_results.mean()
print(PCA_psvm_PCA_kf_accuracy)
#K fold Cross validation
PCA_rsvm_results = cross_val_score(estimator = PCA_rsvm, X = PCA_x_train, y = PCA_y_train, cv = PCA_kf)
PCA_rsvm_PCA_kf_accuracy=PCA_rsvm_results.mean()
print(PCA_rsvm_PCA_kf_accuracy)
Cross_Validation_Score_After_PCA = pd.DataFrame({'Model' : ['SVM Linear KF', 'SVM Polynomial KF', 'SVM RBF KF'],
' Cross validation Score After PCA' : [PCA_lsvm_PCA_kf_accuracy, PCA_psvm_PCA_kf_accuracy, PCA_rsvm_PCA_kf_accuracy],
})
Cross_Validation_Score_After_PCA
Insights:
From the above results, Radial basis function SVM Model trained using the Xpca7 Data gives the highest Average Accuracy in the K-Fold Validation when compared to the Linear and the Polynomial SVM Model
svm_result=pd.merge(SVMresult_Before_PCA, SVMresult_After_PCA,on='Model')
svm_result
From the above results, we can see that the RBF SVM model has higher accuracy when compared to the remaining model. By reducing the dimensionality from 18 to 7, we have dropped around 11% Percent model accuracy.
Cross_Validation_Score_Result=pd.merge(Cross_Validation_Score_Before_PCA, Cross_Validation_Score_After_PCA,on='Model')
Cross_Validation_Score_Result
From the above Cross validation score results, we can see that the RBF SVM model has highest average accuracy when compared to the remaining model. By reducing the dimensionality from 18 to 7, we see only 4-5% drop in the average accuracy